regularized erm
Regularized ERM on random subspaces
Della Vecchia, Andrea, De Vito, Ernesto, Rosasco, Lorenzo
We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data, recovering as a special case Nystrom approaches for kernel methods. Considering random subspaces naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. This unified analysis requires developing new proofs, that use different technical tools, such as sub-gaussian inputs, to achieve fast rates. Our main results show the existence of different settings, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance.
Online learning with dynamics: A minimax perspective
Bhatia, Kush, Sridharan, Karthik
We study the problem of online learning with dynamics, where a learner interacts with a stateful environment over multiple rounds. In each round of the interaction, the learner selects a policy to deploy and incurs a cost that depends on both the chosen policy and current state of the world. The state-evolution dynamics and the costs are allowed to be time-varying, in a possibly adversarial way. In this setting, we study the problem of minimizing policy regret and provide non-constructive upper bounds on the minimax rate for the problem. Our main results provide sufficient conditions for online learnability for this setup with corresponding rates. The rates are characterized by 1) a complexity term capturing the expressiveness of the underlying policy class under the dynamics of state change, and 2) a dynamics stability term measuring the deviation of the instantaneous loss from a certain counterfactual loss. Further, we provide matching lower bounds which show that both the complexity terms are indeed necessary. Our approach provides a unifying analysis that recovers regret bounds for several well studied problems including online learning with memory, online control of linear quadratic regulators, online Markov decision processes, and tracking adversarial targets. In addition, we show how our tools help obtain tight regret bounds for a new problems (with non-linear dynamics and non-convex losses) for which such bounds were not known prior to our work.
Everything old is new again: A multi-view learning approach to learning using privileged information and distillation
Transferring knowledge learned by a powerful model ("teacher") to a simpler model ("student") has become a theme in machine learning. The goal of the knowledge transfer is to have the teacher guide the learning process of the student, so as to achieve high prediction accuracy, or to reduce the sample complexity, which are otherwise hard for the student to achieve by itself. This learning paradigm is practically useful when it is necessary to deploy simpler models to real-world systems, which requires small memory footage or fast processing time. We focus on two specific settings of knowledge transfer in this work. The first one is learning using privileged information (LUPI) [Vapnik and Vashist, 2009], in which the teacher provides an additional set of feature representation to the student during its training process but not the test time, and the extra feature set contains richer information to make the learning problem easier for the student; an example is that the "student may normally only have access to the image of a biopsy to predict the existence of cancer, but during the training process, it also has access to the medical report of an oncologist" [Lopez-Paz et al., 2015]. The second setting is distillation [Ba and Caruana, 2014, Hinton et al.,
Cost Sensitive Learning in the Presence of Symmetric Label Noise
Tripathi, Sandhya, Hemachandra, N.
In binary classification framework, we are interested in making cost sensitive label predictions in the presence of uniform/symmetric label noise. We first observe that $0$-$1$ Bayes classifiers are not (uniform) noise robust in cost sensitive setting. To circumvent this impossibility result, we present two schemes; unlike the existing methods, our schemes do not require noise rate. The first one uses $\alpha$-weighted $\gamma$-uneven margin squared loss function, $l_{\alpha, usq}$, which can handle cost sensitivity arising due to domain requirement (using user given $\alpha$) or class imbalance (by tuning $\gamma$) or both. However, we observe that $l_{\alpha, usq}$ Bayes classifiers are also not cost sensitive and noise robust. We show that regularized ERM of this loss function over the class of linear classifiers yields a cost sensitive uniform noise robust classifier as a solution of a system of linear equations. We also provide a performance bound for this classifier. The second scheme that we propose is a re-sampling based scheme that exploits the special structure of the uniform noise models and uses in-class probability estimates. Our computational experiments on some UCI datasets with class imbalance show that classifiers of our two schemes are on par with the existing methods and in fact better in some cases w.r.t. Accuracy and Arithmetic Mean, without using/tuning noise rate. We also consider other cost sensitive performance measures viz., F measure and Weighted Cost for evaluation.